Human visual explanations mitigate bias in AI-based assessment of surgeon skills

Artificial intelligence (AI) systems can now reliably assess surgeon skills through videos of intraoperative surgical activity. With such systems informing future high-stakes decisions such as whether to credential surgeons and grant them the privilege to operate on patients, it is critical that they treat all surgeons fairly. However, it remains an open question whether surgical AI systems exhibit bias against surgeon sub-cohorts, and, if so, whether such bias can be mitigated. Here, we examine and mitigate the bias exhibited by a family of surgical AI systems—SAIS—deployed on videos of robotic surgeries from three geographically-diverse hospitals (USA and EU). We show that SAIS exhibits an underskilling bias, erroneously downgrading surgical performance, and an overskilling bias, erroneously upgrading surgical performance, at different rates across surgeon sub-cohorts. To mitigate such bias, we leverage a strategy —TWIX—which teaches an AI system to provide a visual explanation for its skill assessment that otherwise would have been provided by human experts. We show that whereas baseline strategies inconsistently mitigate algorithmic bias, TWIX can effectively mitigate the underskilling and overskilling bias while simultaneously improving the performance of these AI systems across hospitals. We discovered that these findings carry over to the training environment where we assess medical students’ skills today. Our study is a critical prerequisite to the eventual implementation of AI-augmented global surgeon credentialing programs, ensuring that all surgeons are treated fairly.


Data splits
To evaluate the performance of SAIS in assessing the skill-level of surgical activity, we used 10-fold Monte Carlo crossvalidation in order to evaluate the performance of SAIS. As such, in this section, we outline the training, validation, and test splits for each of those folds when SAIS was tasked with assessing the skill-level of needle handling (Table 1 left) and needle driving (Table 1 right). Please note that each sample reflects a video on the order of 10 − 30 seconds in duration. With skill assessment being a binary classification task (low-skill vs. high-skill), we balance the number of samples from each class in every data split (training, validation, and test). While doing so during training ensures that the model's performance is not biased towards the majority class, balancing the classes during evaluation (e.g., on the test set) allows for a better understanding of the performance of SAIS and an appreciation of the evaluation metrics we report). For example, with a balanced test set (50 : 50 split between low-skill and high-skill activity), the area under the receiver operating characteristic curve becomes a more meaningful metric of performance. , and surgeons (s) in each fold and data split at USC. We used these samples in the 10-fold Monte Carlo cross-validation setup to train and evaluate SAIS in assessing the skill-level of needle handling (left) and needle driving (right).

SAIS exhibits an overskilling bias
We showed that SAIS exhibits an underskilling bias, erroneously downgrading surgical performance. Here, we provide evidence that SAIS also exhibits an overskilling bias, erroneously upgrading surgical performance ( Supplementary Fig. 1). This is evident by the discrepancy in the PPV for the different surgeon sub-cohorts.

Multi-class skill assessment systems continue to exhibit bias
We demonstrated that a binary surgeon skill assessment system (SAIS) exhibits both an underskilling and overskilling bias.
Implementation details Here, we train SAIS from scratch in order to perform multi-class skill assessment (low vs. intermediate vs. high skill) and assess the degree of its algorithmic bias. This is made possible by trained raters who had previously provided such annotations in the past after following the strict set of criteria in the skill assessment taxonomy. For training and evaluation of the AI system, we follow the same exact strategy as that outlined in the Methods section. Namely, we adopt a 10-fold Monte Carlo cross validation setup where we balance the number of video samples from each class (both during training and evaluation).
Evaluation metrics We do note that because this is a multi-class setup, we have to be careful about the evaluation metrics used to quantify the underskilling and overskilling bias. To remain consistent with their definitions (see Results), we use the shown elements of the confusion matrix ( Supplementary Fig. 2, left) to calculate the degree to which underskilling or overskilling occurs. In other words, we define underskilling as having occurred if the AI system predicts a skill lower than the true skill. For example, predicting a low skill for a true intermediate or high skill, and predicting an intermediate skill for a true high skill. These correspond to the upper triangular region of the confusion matrix. By applying the same logic to overskilling, we can see that the rate with which it occurs can be gleaned from the lower triangular portion of the confusion matrix. We normalize these values based on the total number of predictions for a particular surgeon sub-cohort and present these values in Supplementary Fig. 2 (right).

Findings
We found that such a multi-class system continues to exhibit an underskilling and overskilling bias, emphasizing the need for bias mitigation strategies to alleviate this issue. Note that the degree of bias exhibited by this system cannot be directly compared to the bias exhibited by the binary skill assessment system for several reasons. First and foremost, they are both evaluated on distinct datasets (due to the inclusion of video samples with an intermediate skill label). Second, although the evaluation metrics are similar in spirit in that they both capture either an underskilling or overskilling bias, they remain distinct from one another (e.g., discrepancy in underskilling in the multi-class setting, and discrepancy in negative predictive value in binary setting). Multi-class skill assessment system continues to exhibit algorithmic bias. (left) A confusion matrix reflecting underskilling and overskilling predictions for multi-class skill assessment. (right) SAIS is tasked with assessing the skill-level of needle handling on data from USC. A discrepancy in the rate of underskilling reflects an underskilling bias whereas a discrepancy in the rate of overskilling reflects an overskilling bias (see Evaluation metrics for details). To examine bias, we stratify SAIS' performance based on the total number of robotic surgeries performed by a surgeon during their lifetime (caseload), the volume of the prostate gland, and the severity of the prostate cancer (Gleason score). The results are an average across 10 folds and the error bars represent one standard error.

TWIX can mitigate overskilling bias across hospitals
We demonstrated that TWIX can mitigate the underskilling bias exhibited by SAIS. Having shown that SAIS also exhibits an overskilling bias, we explored whether TWIX can also mitigate this bias. To do so, we present the percent change in the worst-case PPV after adopting TWIX during the training of SAIS ( Supplementary Fig. 3). We found that TWIX can mitigate the overskilling bias across hospitals. This is evident by the improvement in the worst-case PPV for the different surgeon groups at USC and SAH. We present the average performance of SAIS on the most disadvantaged sub-cohort (worst-case NPV) before and after adopting TWIX, indicating the percent change.
An improvement (↑) in the worst-case NPV is considered bias mitigation. SAIS is tasked with assessing the skill-level of a, needle handling and b, needle driving. Note that SAIS is trained on data from USC and deployed on data from St. Antonius Hospital and Houston Methodist Hospital. Results are an average across 10 folds.

6/7
We measured the effectiveness of two additional strategies in mitigating the bias exhibited by SAIS. These two strategies, additional data (AD) and surgical video pre-training (VP), are described in detail in the Methods section. We present the change in the worst-case performance (either NPV or PPV) before and after adopting these two strategies for the task of needle handling skill assessment at USC (Supplementary Fig. 4). We found that while AD and VP do indeed mitigate the underskilling bias, and even more so than TWIX (see Results), they exacerbate the overskilling bias. This is evident by the improvement in the worst-case NPV and a simultaneous reduction in the worst-case PPV after adopting these strategies. These findings emphasize the importance of considering the collateral damage of a bias mitigation strategy: how does it negatively affect other types of bias? Other bias mitigation strategies mitigate underskilling bias yet with collateral damage. We present the average performance of SAIS on the most disadvantaged sub-cohort (worst-case NPV and PPV) before and after adopting two different bias mitigation strategies (top: AD and bottom: VP), indicating the percent change. An improvement (↑) in the worst-case NPV or PPV is considered bias mitigation. SAIS is tasked with assessing the skill-level of needle handling at USC. Results are an average across 10 folds.